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Massively parallel sequencing technology and the associated rapidly decreasing sequencing costs have enabled systemic 
analyses of somatic mutations in large cohorts of cancer cases. Here we introduce a comprehensive mutational analysis 
pipeline that uses standardized sequence-based inputs along with multiple types of clinical data to establish correlations 
among mutation sites, affected genes and pathways, and to ultimately separate the commonly abundant passenger 
mutations from the truly significant events. In other words, we aim to determine the Mutational Significance in Cancer 
(MuSiC) for these large data sets. The integration of analytical operations in the MuSiC framework is widely applicable to 
a broad set of tumor types and offers the benefits of automation as well as standardization. Herein, we describe the 
computational structure and statistical underpinnings of the MuSiC pipeline and demonstrate its performance using 316 
ovarian cancer samples from the TCGA ovarian cancer project. MuSiC correctly confirms many expected results, and 
identifies several potentially novel avenues for discovery. 



[Supplemental material is available for this article.] 

The continued advancement of DNA sequencing technologies 
(Mardis 2011) now allows for the rapid sequencing of large sets of 
cancer cases (matched tumor and normal samples) for the purposes 
of mutation discovery. This technological progression has shifted 
the emphasis in cancer genomics from the analysis of a single pa- 
tient sample to that of hundreds of patient samples across a broad 
range of tumor types. Such an expansion of scope facilitates the 
ascertainment of recurrent mutations within genes and functional 
pathways. Additionally, the increasing scope permits correlating 
mutations and pathways with clinical phenotypes where appro- 
priate clinical data exist. The outcome of such correlation can in- 
clude the identification of prognostic or diagnostic markers or the 
identification of actionable targets for developing therapeutic 
options that may inform clinical trials development. 

To this end, we present a packaged suite of comprehensive, 
user-friendly tools designed to determine mutational significance 
in cancer (MuSiC). The primary goal of MuSiC is to separate the 
significant events which are likely drivers for disease from the 
passenger mutations present in mutational discovery sets using 
a variety of statistical methods. This package provides imique prac- 
tical advantages over existing software and requires a few basic 
input elements: mapped reads in BAM format, predicted or val- 
idated single nucleotide variants (SNVs) and indels in mutation 
annotation format (MAF), a set of regions of interest (typically the 
boundaries of coding exons), and any relevant numeric and/or cat- 
egorical clinical data. Usage is straightforward. With a single com- 
mand, a user can (1) apply statistical methods across the cohort to 
identify significantly mutated genes and (2) identify significantly 
altered pathways and gene sets, (3) investigate the proximity of 
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amino acid mutations within the same gene, (4) search for gene- 
based or site-based relationships and correlations between the 
mutations themselves, (5) correlate mutations to clinical features, 
and (6) cross-reference the findings with relevant databases, such 
as Pfam (Finn et al. 2010), COSMIC (Forbes et al. 2008), and OMIM 
(McKusick 1998). These functions can be accessed individually or 
run as an automated serial implementation. 

To illustrate performance, we tested the MuSiC suite on an 
exome capture data set consisting of 316 cases of ovarian carci- 
noma (OV) that were previously described by the TCGA Research 
Network (The Cancer Genome Atlas Research Network 201 1). The 
original analysis provided statistically supported lists of signifi- 
cantly mutated genes, therapeutically targetable copy number 
amplifications in several genes (e.g., MECOM, MAPKl, CCNEl, emd 
KRAS), evidence of overlaps between DNA methylation clusters 
and gene expression subtypes, and confirmation of involvement 
of the NOTCH and FOXMl signaling pathways in serous ovarian 
cancer pathophysiology. Using MuSiC, we complement the previous 
results with detailed descriptions of the correlations between muta- 
tion spectra and clinical data. We found evidence of mutual exclu- 
sion between mutations in two major tumor suppressors, TP53 and 
RBI, and strong statistical support for the correlation of germUne 
BRCAl point mutations and age at disease onset (Hall et al. 1990; 
Miki et al. 1994). MuSiC also helps us confirm the importance of 
mutations in the P53 DNA-binding domain (Sigal and Rotter 2000). 

Results 

Overview 

The development of MuSiC was motivated by the rapidly 
expanding numbers of mutation data sets from a wide variety of 
tumor types. It is imperative during post-discovery analysis to 
separate the significant, or "driver," mutations from the passenger 
mutations to more accurately pinpoint the key genes and pathways 



22:1589-1598 © 2012, Published by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/12; www.genome.org 



Genome Research 1589 

www.genome.org 



Dees et al. 



critical for disease initiation and progression. MuSiC is designed 
precisely to streamline this process into an easily accessible high- 
throughput software exercise. 

MuSiC currently consists of seven analysis modules and an 
eighth execution module, "MuSiC Play," which runs each analysis 
module sequentially (Fig. 1). MuSiC Play parses the input and out- 
put of each of the individual modules and then produces a com- 
posite summary of all executed modules. Table 1 lists the type of 
analysis performed and the types of variants considered by each 
individual MuSiC module. More detailed descriptions of the specific 
analysis algorithms performed by each module are given below. 

Significantly mutated gene tests 

We use the concept of "significantly mutated genes" (SMG) to 
describe genes that show a significantly higher mutation rate than 
the background mutation rate (BMR) when multiple mutational 
mechanisms (coding indel and single nucleotide substitution, 
splice site mutation, etc.) are considered. Specialized measurements 
of the BMR may also be considered; BMRs in MuSiC are optionally 
calculated across the entire sample set, across particular subgroups 
of similarly mutated samples, or for each sample individually. For 
each BMR subgroup considered and for each category of mutational 
mechanism, the mutation rates are compared to the appropriate 
BMR, and a single P-value summarizing all considerations is gen- 
erated for each gene. We refer to this summarization procedure as 
the significantly mutated gene (SMG) test. 

We assessed multiple methods of calculating summarized 
P-values, including a convolution test (CT), a Fisher's combined 
P-value test (FCPT), and the likelihood ratio test (LRT), using 
a partially simulated data set (this data set and the associated test 
simulations are described in the Supplemental Material). By this 
approach, we determined that the P-value distribution obtained 
using the CT method most closely resembled the uniform distri- 
bution expected under the null (in this case, the null is such that 
no gene is truly significantly mutated), while the FCPT and LRT 
methods produced slightly inflated or deflated P-values, respec- 
tively (Supplemental Fig. SI). During the SMG test, a false discovery 
rate (FDR) also is calculated. We evaluate our SMG test results by 
establishing a P-value or FDR threshold (threshold typically 0.2 or 
less for FDR), and then appropriately filtering the test output. 

The results of MuSiC's SMG analysis for the ovarian cancer 
data set were previously reported (The Cancer Genome Atlas Re- 
search Network 2011). Briefly, there were 12 genes found to be 
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Figure 1. MuSiC flow diagram. MuSiC modules can either be imple- 
mented individually with various required input files or may be imple- 
mented in serial via one command where four inputs are used to execute 
the entire package of tools. 



"In all of the tools, a user may optionally include only nonsynonymous 
variants, or, alternatively, both nonsynonymous and synonymous variants 
may be considered. 



significantly mutated in the data set. The CT, FCPT, and LRT P-values 
for these genes as well as the BMR for each mutational mechanism 
category in the ovarian data set are displayed in Figure 2. BRCAl 
and BRCA2 are known ovarian cancer risk genes (King et al. 
2003; Pal et al. 2005). In addition to 27 (BRCAl) and 25 (BRCA2) 
germline nonsense, splice site, and indel mutations, 11 and 10 
nonsynonymous somatic mutations were discovered in this data 
set in BRCAl and BRCA2, respectively (The Cancer Genome Atlas 
Research Network 2011). 

Significantly mutated patliway/gene set analysis 

To identify known cellular pathways with significant accretions of 
somatic mutations in ovarian tumors, we integrated the PathScan 
algorithm (Wendl et al. 2011) as a module of the MuSiC pipeline. 
PathScan treats pathways as groups of genes defined by databases 
such as KEGG (Kanehisa and Goto 2000), BioCarta (Nishimura 
2001), and Reactome Goshi-Tope et al. 2005), with the KEGG 
definitions currently set as the default implementation. PathScan 
can be configured, however, to assess any grouping of genes, including 
groupings from nonpathway databases such as Pfam (Finn et al. 2010). 

Using PathScan, we analyzed the OV somatic mutation data 
set in two ways. First, the entire data set was analyzed regardless 
of the frequency of mutation in specific genes. Secondly, due to the 
overwhelming abundance of TP53 mutations, we also performed the 
analysis using identical parameters but excluding TP53 mutations. 

The most significant pathways identified in the first analysis 
were a collection of KEGG cancer pathways including "Thyroid 
Cancer" (hsa05216), "Bladder Cancer" (hsa05219), "Basal Cell Car- 
cinoma" (hsa05217), "Nonsmall Cell Lung Cancer" (hsa05223), and 
"Melanoma" (hsa05218). In the midst of those significant cancer 
pathways sits the "p53 Signaling" pathway (hsa04115) at aP-value 
of 2.62 X 10"^^^. Also, MuSiC found the "Apoptosis" pathway 
(hsa04210), including not only rP53 mutations but also nine 
phosphoinositide 3-kinase mutations, to be affected. This latter 
group of mutations includes two PIK3CA mutations, previously 
implicated in both breast and ovarian cancers (Levine et al. 2005). 

In the second analysis where TP53 mutations were excluded, 
the collection of KEGG cancer pathways was no longer identified 
as the most significant pathways in the OV data set. Instead, for 
instance, this analysis identified the environmental information 
processing class "Receptors and Channels" pathway from the KEGG 
Brite database (hsa04000) as the most significant (P = 4.36 x 10"^^) 
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RYRl and RYR2 genes, whose expression 
has been correlated with tumor grade In 
breast cancer (Abdul et al. 2008). 



Figure 2. Mutation rates and SMGs in tfie OV data set. (A) The cohort-wide bacl<ground mutation 
rates for all seven mutational mechanism categories are plotted for the OV data set. The overall BMR is 
also plotted, combining all types of mutations. (6) -Logio(P)for the top 1 2 OV SMGs are plotted for all 
three SMC tests in order of decreasing convolution test P-value. 



(Table 2). Limiting the analysis scope to the KEGG Pathway data- 
base, a similar pathway was Identified from the environmental 
Information processing class, the "Neuroactlve Llgand-Receptor 
Interaction" pathway (hsa04080). This pathway Incurred 266 
mutations across the OV data set yielding a P-value of 2.5 x 10"^^. 
The "Calcium Signaling" pathway (hsa04020, P = 4.9 x 10"**) also 
rose to the level of significance from the TP53-excluded KEGG 
analysis, which Is Interesting due to the role of calcium signaling In 
many cellular processes Including cell death (Crompton 2000). The 
results from the OV data set feature 266 mutations throughout this 
pathway, highlighted by 35 mutations In voltage-dependent calcium- 
channel genes (CACNAIA-H and CACNAIS), and 25 mutations In 



Mutation relation test 

The mutation relation test (MRT) at- 
tempts to reveal correlations and mutual- 
exclusion relationships among significantly 
and highly mutated genes In a palrwlse 
fashion. Positive correlations suggest that 
mutations and their associated pathways 
putatlvely function synerglstlcally to pro- 
mote carcinogenesis, while negative corre- 
lations Imply that the alteration of a single 
component or pathway may be sufficient, 
wholly or In part, for carcinogenesis. 

An example heat map of the MRT 
analysis for all 316 OV samples Is shown 
In Figure 3. For this data set, we found 
examples of both concurrent mutations 
and also mutually excluded mutations among the genes repre- 
sented In the heat map. Co-mutators FATS and EMR3 (P = 0.0333) 
are both members of the "Receptors and Channels" KEGG path- 
way (hsa04000). Identified as significantly mutated In the pathway 
analysis described above. And there Is also mild evidence of the 
mutual exclusion of RBI and TP53 mutations (P = 0.0141). Both of 
these genes are tumor suppressors, and both were found previously 
to be significantly mutated genes In this data set (The Cancer Ge- 
nome Atlas Research Network 2011). This result Is potentially 
meaningful If one considers the possibility that RBI mutations 
could be driver events that act Independently from TP53. It Is well 
known that the rate of TP53 mutations in ovarian cancer Is very 



Table 2. PathScan results for the OV data set 
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Pathways in cancer 

Cell adhesion molecules (CAMs) 

Cytokines 
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Cytokine-cytokine receptor interaction 

Regulation of actin cytoskeleton 

GTP-binding proteins 

Cell adhesion molecules (CAMs) 

Tight junction 
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Chemokine signaling pathway 

Antigen processing and presentation 



hsa04000 


4.36 


X 


10" 


-90 


4.36 


X 


10" 


-90 


3.24 


X 


10 


-89 


6.15 


X 


10" 


88 


hsaOSOOO 


3.14 


X 


10" 


-93 


3.15 


X 


10" 


-41 


2.60 


X 


10 


-92 


2.22 


X 


10" 


-39 


hsa04740 


2.61 


X 


10" 


32 


2.61 


X 


10" 


-32 


1.47 


X 


10 


-31 


1.23 


X 


10" 


-30 


hsa04510 


3.03 


X 


10" 


29 


3.03 


X 


10" 


-29 


1.64 


X 


10 


-28 


1.07 


X 


10" 


27 


hsa04080 


i.n 


X 


10" 


23 


1.11 


X 


10" 


-23 


5.80 


X 


10 


-23 


3.13 


X 


10" 


22 


hsa04512 


1.40 


X 


10" 


23 


1.40 


X 


10" 


-23 


7.05 


X 


10 


-23 


3.29 


X 


10" 


22 


hsa04516 


5.78 


X 


10" 


23 


5.78 


X 


10" 


-23 


2.81 


X 


10 


-22 


1.16 


X 


10" 


21 


hsa04090 


1.85 


X 


10" 


-19 


1.85 


X 


10" 


-19 


8.70 


X 


10 


-19 


3.26 


X 


10" 


-18 


hsaOIOOl 


1.76 


X 


10" 


16 


1.76 


X 


10" 


-16 


8.01 


X 


10 


-16 


2.76 


X 


10" 


-15 


hsa01002 


2.05 


X 


10" 


-15 


2.05 


X 


10" 


-15 


9.03 


X 


10 


-15 


2.89 


X 


10" 


-14 


hsa04812 


1.36 


X 


10" 


-14 


1.36 


X 


10" 


-14 


5.81 


X 


10 


-14 


1.74 


X 


10" 


-13 


hsa04121 


1.60 


X 


10" 


-14 


1.60 


X 


10" 


-14 


6.64 


X 


10 


-14 


1.88 


X 


10" 


-13 


hsa04020 


4.24 


X 


10" 


-14 


4.24 


X 


10" 


-14 


1.71 


X 


10 


-1 3 


4.60 


X 


10" 


-13 


hsa05200 


3.26 


X 


10" 


-73 


8.81 


X 


10" 


-14 


2.19 


X 


10 


-72 


8.87 


X 


10" 


-13 


hsa04514 


2.02 


X 


10" 


1 1 


2.02 


X 


10" 


-1 1 


7.91 


X 


10 


-1 1 


1.90 


X 


10" 


-10 


hsa04052 


4.58 


X 


10" 


1 1 


4.58 


X 


10" 


-11 


1.75 


X 


10 


-10 


4.04 


X 


10" 


-10 


hsa03036 


6.16 


X 


10" 


-1 1 


6.16 


X 


10" 


-11 


2.29 


X 


10 


-10 


5.11 


X 


10" 


-10 


hsa04060 


8.46 


X 


10" 


1 1 


8.46 


X 


10" 


-11 


3.05 


X 


10 


-10 


6.26 


X 


10" 


-10 


hsa04810 


8.79 


X 


10" 


-1 1 


8.79 


X 


10" 


-1 1 


3.05 


X 


10 


-10 


6.26 


X 


10" 


-10 


hsa04031 


8.88 


X 


10" 


-11 


8.88 


X 


10" 


-1 1 


3.05 


X 


10 


-10 


6.26 


X 


10" 


-10 


hsa04515 


1.25 


X 


10" 


-10 


1.25 


X 


10" 


-10 


4.20 


X 


10 


-10 


8.39 


X 


10" 


10 


hsa04530 


4.26 


X 


10" 


-09 


4.26 


X 


10" 


-09 


1.40 


X 


10 


-08 


2.73 


X 


10" 


-08 


hsa01003 


1.63 


X 


10" 


-08 


1.63 


X 


10" 


-08 


5.22 


X 


10 


-08 


9.99 


X 


10" 


-08 


hsa04062 


2.38 


X 


10" 


08 


2.38 


X 


10" 


-08 


7.46 


X 


10 


-08 


1.40 


X 


10" 


-07 


hsa04612 


9.47 


X 


10" 


-08 


9.47 


X 


10" 


-08 


2.90 


X 


10 


-07 


5.29 


X 


10" 


-07 



The top 25 significantly mutated pathways as discovered using PathScan are listed here, sorted by the P-value calculated while excluding gene TP53. 



Genome Research 1591 



www.genome.org 



Dees et al. 



whereas patients with somatic BRCAl 
mutations exhibited no such correlation 
(P = 0.308). A boxplot of the ages at di- 
agnosis for the OV samples (Fig. 4) clearly 
shows that the mean age of those samples 
with a germline BRCAl mutation (51.3 
yr) is lower than those samples with ei- 
ther wild-type BRCAl (60.4) or with a so- 
matic BRCAl mutation (63.1 yr). Thus, 
the CCT correctly evaluated the relation- 
ship between germline variants in BRCAl 
and ovarian cancer susceptibility. 



I II 



II II 



316 serous ovarian cancer samples 

Figure 3. Mutation relation analysis. A heat map showing mutations in highly mutated genes for all 
316 OV samples. Dotted-line boxes highlight concurrent nonsynonymous EMR3 and FATS mutations 
(two concurrent mutations out of five nonsynonymous mutations from EMR3 and 1 8 from FAT3, P = 
0.0333) and mutually exclusive nonsynonymous RBI and TP53 mutations (297 mutually exclusive 
mutations out of six nonsynonymous mutations from RBI and 299 from TP53, P = 0.0141). 



high (Ahmed et al. 2010), including the OV data set used in this 
analysis (The Cancer Genome Atlas Research Network 2011). How- 
ever, for the few cases that do «ot have a driver mutation In TPS 3, we 
speculate, based on our mutual exclusion results, that mutations in 
RBI may represent an independent path to ovarian adenocarci- 
noma. Of course, due to the small numbers of mutations present 
in RBI and EMR3 (six and five mutations, respectively), additional 
data would be required to confirm any hypotheses generated using 
these results. 

Clinical correlation test 

The clinical correlation test (CCT) can be used to determine re- 
lationships between clinical phenotypes and observed mutations. 
The input clinical data may be represented in either numeric or 
categorical ("class") formats. For example, in the OV data set, we 
obtained clinical data for 315 of the 316 OV samples; the numeric 
clinical data consisted of the patients' ages at disease diagnosis and 
also their survival periods (in days), and the categorical clinical 
data for the OV data set included information about a sample's 
race, tumor stage, tumor grade, the outcome of the primary ther- 
apy, and, lastly, their vital status. For both data types, the goal of 
the CCT is to determine whether specific mutations/genes are as- 
sociated with a particular clinical feature. As these associations can 
sometimes be biased by covariate clinical features, MuSiC also of- 
fers a generalized linear model (GLM) analysis option within the 
CCT. This tool allows users to define any number of clinical traits as 
covariates to discovered mutations and, subsequently, to eliminate 
any possible biases introduced to the phenotype/mutation asso- 
ciations by the covariates' effects. 

As a proof of principle, we have assessed the well-established 
relationship between the presence of a BRCAl germline variant 
and a patient's age at disease diagnosis (Hall et al. 1990; Miki et al. 
1994). In the OV data set, the CCT revealed that patients with 
germline BRCAl variants were significantly correlated with earlier 
disease diagnosis (P = 2.456 x 10"^, Wilcoxon rank sum test). 



Proximity analysis 

In certain genes, mutations tend to duster 
in close proximity within fimctional do- 
mains. In order to find these dense "clus- 
ters" of mutations within a mutation list, we 
have developed MuSiC's proximity analysis 
module. This module searches within fixed 
windows around each mutation, reporting 
the number of and distances to all neigh- 
boring mutations. The size of the fixed 
windows utilized for searching is user- 
configurable. In order to determine an ap- 
propriate default size for these windows, we 
have analyzed the distances between all neighboring mutations in 
version 54 of the COSMIC database (see Methods). Upon finding that 
over 25% of the nearest-neighbor mutations in COSMIC are within 
seven (or less) amino acids of each other, we chose to search 7 aa both 
upstream of and downstream from each OV mutation for dense 
clusters of variants. TPS3, with 302 total nonsilent somatic muta- 
tions in the OV data set, dominated the proximity analysis results. 
The average number of mutations within 7 aa of another rP53 
mutation was 4.9, with the densest 14-aa window containing 26 
nonsynonymous mutations. 
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Figure 4. BRCAl variant status versus sample age of diagnosis for the OV 
data set. A boxplot of the age of diagnosis of 31 5 OV patients grouped by 
their BRCAl mutation status. Germline BRCAl variant status is correlated 
with a lower age of diagnosis via the CCT (P = 2.456 x 1 0"^). 
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The next-densest group of mutations occurred in DNAH5, 
where there were four mutations within a space of 3 aa. Several 
genes have mutations that occur in triplets within a space of 2 aa, 
including UBR4 and RBICCI. Both UBR4 and RBICCI have re- 
lationships with RBI, a gene on the significantly mutated gene list 
for the OV data set and a gene also found to harbor copy-number 
alterations in the OV data set (The Cancer Genome Atlas Research 
Network 2011). UBR4 is a component of the N-end rule pathway 
that interacts with RBI, and RBICCI actually regulates the expres- 
sion of RBI. RBI itself has two mutations in close proximity (within 
1 aa of each other). These high-density groups of _RBi -related mu- 
tations, pictured in Figure 5, may support the hypothesis that RBI 
could be an additional driver of ovarian cancer. 

COSMIC/OMIM query 

Using the COSMIC/OMIM module of MuSiC, we attempted to find 
previously reported mutations matching the query set of somatic 
OV mutations. This type of analysis can provide a measure of re- 
currence, as these databases generally contain information about 
the studies from which their contents were derived, and the COSMIC 
database deals exclusively with the somatic mutations discov- 
ered in cancer studies. A summary of COSMIC/OMIM database 
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comparisons for those significantly mutated genes listed above 
with at least one database match of any type is presented in Sup- 
plemental Figure S2. This summary, however, represents only 
a subset of all of the information made available via the database 
queries. We found 15 exact matches in genomic position and nu- 
cleotide change between the OV data set and COSMIC, including 
sites in NFl, RBI, and PIK3CA, all considered significantly mutated 
genes by the MuSiC pipeline. This type of match is only possible 
when comparing to the COSMIC database, since OMIM entries 
contain only amino acid coordinates. We identified another exact 
match in the FOXGl gene, which encodes a forkhead transcription 
factor. Not only was the FOXMl transcription factor network cited 
as significantly altered in 87% of samples in the previous TCGA 
study (The Cancer Genome Atlas Research Network 2011), but, 
additionally, some forkhead transcription factors were previously 
identified as therapeutic targets (Moumne et al. 2008; Wang et al. 
2010b). 

In addition to finding the above COSMIC variants which 
shared positions and identical nucleotide changes with OV mu- 
tations, our comparison of OV mutations to the COSMIC and 
OMIM databases also identified a large set of database mutations 
that altered the same amino acid in an identical manner as an OV 
mutation. Of 233 such matches from COSMIC (76 from OMIM), 
the overwhelming majority of these, 219 
(56), were from the gene TP53. There 
5^ were 229 (76) other "position" matches, 

I defined as mutations which affect an iden- 

tically positioned amino acid but which 
do not cause the same residue change as 
Scale (AA) the previously reported mutation. Most 
of these matches are from genes that have 
been previously associated with ovarian 
cancer (The Cancer Genome Atlas Research 
Network 2011), such as BRAF, BRCAl, 
□ zf-uBR KRAS, and again, TPS3. 
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Figure S. Proximity analysis mutation diagrams. These mutation diagrams show recurrent triplet 
mutations in both UBR4 and RBICCI , both of which harbor a relationship with tumor suppressor gene 
RBI. 



The Pfam annotation module of MuSiC 
groups genes based on the frequency of 
mutation in specific protein domains. 
Grouping mutations by their protein do- 
main can serve to group genes according 
to putative function, since genes that share 
a domain are more likely to share related 
functions. We performed Pfam annotation 
on the 19,356 somatic variants identified 
in the 316 OV cases. Supplemental Table 
SI reports the number of nonsynonymous 
mutations, synonymous mutations, and 
also the number of genes harboring mu- 
tations in each Pfam domain with at least 
five somatic events in the OV data set. 

In the analysis of the OV data, many 
of the frequently mutated domains are also 
the most prevalent domains in the genome, 
including the seven-transmembrane G 
protein-coupled receptor domain, the pro- 
tein kinase domain, and the zinc finger 
domain, as illustrated in Figure 6A. This 
genome-wide abundance is not true, how- 
ever, for the amply mutated P53 domain. 
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Figure 6. Pfam domains affected by OV mutations 

domains in the OV data set next to the number of genes affected in each domain. (6) A stacl<ed bar- 
graph where the value 100% represents the total number of mutations in a particular Pfam domain 
Lighter and darker sections of the bars represent which proportions of the total mutations are non 
synonymous and synonymous, respectively. 



Analysis of this Pfam annotation result correctly confirms the 
significance of the P53 domain mutations in this cohort. Figure 6A 
also shows that this domain is recurrently mutated but only in 
a small number of genes, much different from all of the other 
domains pictured. And lastly. Figure 6B shows that the P53 domain 
has an unusually high nonsynonymous:synonymous mutation 
ratio. All of these details are indicative (correctly) of an important 
mutation hotspot in this cohort. 

The Pfam annotation module output may also be modified 
slightly and fed into the SMG test algorithm in order to produce 
a mathematical result describing "significantly mutated domains" 
(rather than significantly mutated genes) and their associated 
P-values. For a detailed explanation of this option, please see the 
Supplemental Material, including Supplemental Table S2. The re- 
sults presented therein reaffirm the significance of the mutations 
in the PS 3 domain, as well as in the other frequently mutated 
domains. 

Discussion 

There are several software tools available purporting to determine 
mutational significance. CHASM (Carter et al. 2010), for instance, 
uses a machine-learning technique to distinguish driver mutations 
from passengers based on a driver/passenger mutation training set. 
Mutation Assessor (Reva et al. 2011) provides a prediction of the 
functional impact of a mutation based on the specificity of mul- 
tiple sequence alignments and conservation scores. And tools such 
as CanPredict (Kaminker et al. 2007) are database-driven, deriving 
multiple metrics for each variant, starting with stored information 
and models, and then making use of the metrics through a de- 
cision-tree analysis. Other tools, such as ANNOVAR (Wang et al. 
2010a), provide detailed annotations of genetic alterations, much 
like MuSiC's Pfam annotation module. Although not currently 
publicly available, the "Firehose" pipeline does share some fea- 
tures with MuSiC, including an SMG analysis tool and a pathway 
analysis tool, PARADIGM (Vaske et al. 2010), but Firehose as a 
whole is heavily focused on orderly sequencing and quality control 
rather than post-discovery variant analysis. 

MuSiC, on the other hand, is the first available set of com- 
bined tools that enables a complete, multidimensional statistical 
evaluation of next-generation-derived cancer data sets. No other 
publicly available tool suite currently incorporates clinical data 
along with coverage data and database references into the 



determination of the most significant 
mutations among a large mutation list full 
of passengers. MuSiC merges several meth- 
odological aspects of the above-mentioned 
tools with many novel additional algo- 
rithms, providing the capability of large- 
cohort, data-driven statistical analysis to 
the entire research community. 

Use of MuSiC is straightforward. The 
simplicity of the input files and tool 
updating make this package extremely 
accessible. MuSiC also accommodates both 
large and small research organizations; 
although the entire suite of tools is capa- 
ble of running sequentially on a single 
processor, the most CPU-intensive mod- 
ules offer easily parsable options for par- 
allelizing jobs across multiple machines 
or across a job cluster. 
Future support for the MuSiC package will be devoted to the 
handling of additional file formats, such as the variant call format 
(VCF), and also to the development of a graphical user interface 
(GUI). We intend to design new tools aimed at incorporating 
a wider variety of biological and variant data types, such as copy 
alteration data, and 3D protein structures from RCSB's Protein Data 
Bank (Berman et al. 2000), to be used for improved proximity 
analysis calculations. We also intend to implement recurrence 
tests across other data entities, such as significantly mutated 
transcripts and significantly mutated gene families. We also plan 
to enhance the assessment of the functional impact of pro- 
tein mutations in MuSiC through the integration of published 
solutions, such as Mutation Assessor (Reva et al. 2011) and 
PolyPhen-2 (Adzhubei et al. 2010), as well as through the design 
of new modules which take advantage of databases that cate- 
gorize such effects, such as SIFT (Kumar et al. 2009). And lastly, a 
fuller integration of the results from various MuSiC modules, 
many of which are currently considered in isolation, wiU provide an 
even more comprehensive picture of cancer genomic mechanisms. 

Methods 

Sample data set used in thiis study 

Alignment mapping files for the ovarian cancer cohort, as well as 
all mutational data in mutation annotation format (MAF) are 
available at The Cancer Genome Atlas (TCGA) website, http:// 
cancergenome.nih.gov/, via dbGaP. The dbGaP study accession 
number is phs000178.vl.pl. 

Significantly mutated gene tests 

We describe our calculation of background mutation rates and 
three methods for calculating summarized P-values below. 

Background mutation rate (BMR) 

Calculations of BMR, although simply described as the number of 
mutated bases per total bases, are often controversial. Our method 
is dependent upon the data available, under the premise that the 
number of bases having available alignment data should provide 
the denominator for the BMR calculation, since this number also 
provides the upper limit to the number of bases available for mu- 
tation discovery. Therefore, we first count the number of bases 
with sufficient aligned read-depth based upon user-defined coverage 
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limits set Independently for the tumor and normal BAM files for each 
sample in the cohort. For the purposes of our algorithm, coimts are 
determined for six specific reference sequence-based denominations, 
with separate counts for each of A and T bases, CpG-connected C and 
G bases, and lastly C and G bases not connected as CpG. We also 
categorize the discovered mutations along the lines of mutational 
mechanism, with separate categories for AT transitions, AT trans- 
versions, CpG transitions, CpG transversions, CG (non-CpG) tran- 
sitions and transversions, and a seventh "indel" category, for which 
we use the entirety of the covered space in the sample-set when 
comparing indel-affected bases versus available bases. In order to 
calculate the BMR of each mutational mechanism category, the 
number of mutations found in that category is divided by the 
total number of bases available in which such a call could have 
been made. 

MuSiC allows for the calculation of BMR and for the sub- 
sequent comparisons of mutation rate to BMR, to be performed 
separately for each subgroup in a user-specified number of sub- 
groups. If the user specifies that only one subgroup is to be con- 
sidered (the default option), then the entire cohort is used in the 
calculations of mutation rate and BMR for the seven mutational 
mechanism categories. In this case, seven P-values are generated 
for each gene and are summarized using the three methods de- 
scribed below. However, if more than one subgroup is to be con- 
sidered for a given cohort, the mutation rates and BMRs for the 
seven mutational mechanism categories are calculated separately 
for each subgroup (seven mutation rate and BMR calculations per 
subgroup), with P-values generated for each category within each 
subgroup. In this case, the number of P-values generated for each 
gene will equal seven times the number of subgroups to be con- 
sidered. All P-values for each gene are still summarized using the 
three methods discussed below. 

If the number of subgroups to be considered is greater than one 
but not equal to the number of samples in the cohort, the BMR for 
each sample is first calculated individually, and then these per- 
sample BMR values are clustered using k-means clustering into the 
number of subgroups specified by the user. These clustered sub- 
groups are then used for all subsequent mutation rate and BMR 
calculations. Alternatively, if the number of subgroups specified by 
the user equals the number of samples in the cohort, then each 
sample is considered independently for each mutation rate and BMR 
measurement. 

The BMRs calculated across the entire OV data set for each 
mutational mechanism category, including the overall BMR across 
all categories, are plotted in Figure 2A. 

Fisher's combined ?-value test (FCPT) 

FCPT combines all P-values for a particular gene into a statistic, Xo 
according to Fisher's method (Fisher 1925), 

Xc= - 2ilog(Pi), 

where pi is the P-value obtained via binomial distribution for the 
i-th subgroup mutational mechanism type, and k the number of 
subgroup mutational mechanism categories for a gene. The final 
P-value for the entire gene is calculated as the probability of 
observing a value no less than Xc, based onax^ distribution with 2k 
degrees of freedom. 

Likelihood ratio test [LRT) 

LRT constructs a likelihood ratio-based statistic (xj) for a gene, 

fLiMuCi\ri)\ 



where M,-, Q, Rt, and r,- are mutation number, coverage, BMR, and 
maximum likelihood estimate (MLE) of the mutation rate, re- 
spectively, for the i-th subgroup mutational mechanism category 
of a gene, k is the number of mutation types, and I() is the like- 
lihood of observed mutation number for the i-th subgroup muta- 
tional mechanism category, defined as the point probability of 
observing Mj mutations given a coverage of C; and a mutation 
rate of R,- or r,-. The final P-value for the entire gene is calculated as 
the probability of observing a value no less than xi, based on an 
approximate x^ distribution with k degrees of freedom. 

Convolution test (CT) 

CT calculates a summarized log statistic of joint binomial point 
probability, 

Sg= - tlog{L{Mi,Q\Ri)), 

where M,-, C,-, Ri, k, and LQ are referred to as the same as in the LRT 
method. Getz et al. (2007) proposed that the P-value for a gene can 
be calculated by taking one minus a left-tail probability, i.e., the 
probability of observing a value less than S^,, and the semiexact 
distribution of can be obtained by a binned histogram-based 
convolution procedure. This procedure is advantageous for the 
large amounts of data involved in genome-wide investigation of 
cohort mutations because it provides exact P-values in a minimum 
amount of computation time; usually one must choose between 
precision and length of compute. However, one disadvantage of 
this convolution procedure is that the two tails of the Sg distiibu- 
tion are mixed, and therefore the directioncility is lost. In this case, 
both improbably large mutation rates and improbably low muta- 
tion rates give high values for Sg. To remedy this issue, our CT 
method modifies the published convolution procedure by ex- 
cluding genes with mutation rates that are close to or equal to 0. 
In other words, we exclude genes whose mutation rates are ex- 
tremely far below the BMR, as they never would be considered 
"signiflcantiy mutated" in practice. 

The output file for the SMG test is a compilation of the 
P-values for each test for each gene under the null hypothesis that 
the number of mutations seen in the gene is in accordance with 
those seen in the background. False discovery rates are also 
reported for each test for each gene. 

Significantly mutated patliway/gene set analysis 

Sequence data from a single cancer genome do not contain sufficient 
information to adequately investigate pathway/phenotype associa- 
tions. This problem requires systematic analysis of larger cohorts. 
There are several published methods for identifying signlficantiy 
mutated pathways or gene sets (e.g., Lm et al. 2007; Tarca et al. 2009; 
Vaske et al. 2010), but we have incorporated a new tool called 
PathScan into the MuSiC package. PathScan accounts for two 
important factors other methods neglect: (1) variations in gene 
lengths and the consequent differences of their mutational likeli- 
hoods under the null hypothesis; and (2) distribution of mutations 
among samples and their proper combination into an overall 
P-value. 

We have configured PathScan in MuSiC to perform pathway 
analysis using a wide variety of annotated databases, including 
KEGG (Kanehisa and Goto 2000), BioCarta (Nishimura 2001), and 
Reactome (Joshi-Tope et al. 2005), as mentioned above. The results 

described above where TP53 is excluded from the pathway analysis 
are achieved easily through MuSiC's implementation of a parame- 
ter called "genes-to-ignore," in which the user may provide 
a comma-delimited list of genes whose mutations should be skipped 
over when reading the MAF file during the analysis. 
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Mutation relation test (MRT) 

We have developed the MRT to look for any latent relationships 
among mutated genes. This module is a correlation test for two 
binary variables to see whether or not any two genes are mutated 
concurrently (positive correlation) or exclusively (negative corre- 
lation). Because the numbers of mutated genes may vary signifi- 
cantly among samples (most have only a few mutations whereas 
some may have many), and because direct correlation between 
genes in high-mutatlon-count samples is not comparable to that in 
low-mutation-count samples, classic correlation analysis is invalid. 
To control for this confounder, we calculate P-values in the MRT 
through a restricted permutation, taking into account the distri- 
bution of mutated gene numbers amongt the samples. Permuta- 
tions are performed, therefore, by first calculating the numbers of 
mutated genes in individual samples and then randomly per- 
muting the observed mutations, keeping both the number of 
mutations in each gene and also the distribution of mutated gene 
number per sample constant. Concurrence and exclusion among 
mutations are tested separately. This method has been successfully 
applied to lung cancer data in our previous study (Ding et al. 2008). 

Clinical correlation test (CCT) 

The CCT module was developed for detecting correlations between 
mutations and clinical features. Mutation data are again treated as 
0-1 variables, where all mutations occurring within the same gene 
are grouped (the default method), or optionally, where a more 
strict definition of collapsing is used, grouping only those muta- 
tions with identical genomic coordinates and nucleotide changes. 
Clinical features can either be categorical or numerical variables, 
where a Fisher's exact test (and optionally, a test) is used to 
calculate P-values for categorical clinical data, and a Wllcoxon rank 
sum test (and optionally, Pearson's correlation) is used to analyze 
numerical clinical data. The null model for the calculated P-values 
depends on the type of clinical data being analyzed. For categorical 
clinical data, the null model is that mutations happen randomly in 
the various categories represented across the samples, with no 
preference given to any one category. For numerical clinical data, 
the null model is that mutations have no effects on a trait (i.e., that 
mutations will not increase or decrease the value of the numeri- 
cally reported trait). 

A GLM analysis option also exists in which a user may define 
an analysis model which must Include a response variable and 
a variant, and then any number of covariates may be considered. 
The variants and covariates may all be clinical features, or, more 
typically, genes from the mutation list should be listed as variants, 
with clinical features making up the covariates. The output of the 
GLM option includes a P-value which indicates association be- 
tween the response variable and the variant, as well as other 
standard outputs from a deviance analysis. 

Proximity analysis 

In MuSiC, we have defined proximity analysis as an investigation 
of the density of mutations across a cohort in the amino acid space 
of a given gene's annotated transcript. The goal of this analysis is to 
extract evidence of mutations clustered within specific domains, 
an event which is significant in light of the null model where point 
mutations occur at random locations throughout the genome. 
Highly mutated domains across a group of samples might be in- 
dicative of an underlying mechanistic association which contrib- 
utes to the shared condition of the cohort, such as the onset of 
disease. The specific action taken in the tool is to query the dis- 
tance (in number of amino acids) between every pair of mutations 



on a given transcript of a mutated gene within the sample set, and 
then to determine which mutations fall within "close proximity." 
"Close proximity" may be determined by the user as an input to 
the module, with the units of this parameter being the number of 
sequential amino acids between two events. 

The algorithm operates as follows: First, the amino acid po- 
sition of each mutation within its respective transcript is de- 
termined. Then, for each mutation in the input MAF file, two 
values are calculated: (1) the number of other mutations on the 
same transcript within the proximity limit set by the user; and (2) 
the distance to the closest other mutation in this nearby set. (Note 
that if more than one mutation occurs within the same amino 
acid, the dlstcince to the closest mutation in this case would be 
0 aa.) Then, for each mutation, we report a few general details from 
the MAF file in the output, such as the genomic coordinates, ref- 
erence and variant bases, and the sample in which the mutation 
was found. We further report the calculated values, such as the 
amino acid position of the mutation within the transcript, and also 
(1) and (2) listed directly above. 

In the Proximity Analysis Results section, we chose to use 7 aa 
as the maximum distance from any OV mutation within which to 
look for clusters of mutations. This default distance was chosen 
after querying the entire COSMIC mutation database (version 54, 
comprising 171,473 mutations), and looking for the proximity of 
mutations found in dense clusters. For each gene present in the 
database, we calculated the distance between each mutation in 
that gene and its closest neighboring mutation. As many muta- 
tions sharing the exact same genomic coordinates are reported 
more than once in COSMIC, we only used one Instance of each 
mutational position when calculating nearest-neighbor distances. 
Through these per-gene measurements, we found that over 25% of 
all neighboring mutations in COSMIC are within 7 aa of each other 
(25% thus represents the first quartile of nearest-neighbor dis- 
tances). We thus set 7 aa as the default range within which to 
search for clusters of mutations in the proximity analysis module. 

COSMIC/OMIM query 

In the COSMIC/OMIM tool, a list of mutations is queried for re- 
currence against both the Catalogue of Somatic Mutations in 
Cancer (COSMIC) database and the Online Mendelian Inheritance 
in Man database (OMIM). The usefulness of this tool is twofold; 
first, recurrent events between these two databases and a list of 
variations from a cohort provide a quick extension set, especially as 
the databases are expanded over time, and secondly, we have 
downloaded entries from the COSMIC database related to muta- 
tions that report some type of amino acid transformation and 
provide this file with the MuSlC package, making this type of query 
very user-friendly. 

The tool functions as follows: For every mutation found in the 
input MAF file, information related to this mutation is gathered 
from both the COSMIC and OMIM databases. Relevant in- 
formation is ascertained by relating the genomic coordinate 
(COSMIC) or the amino acid change (COSMIC and OMIM) asso- 
ciated with the variant to all database entries. (Note that the MAF 
used as input must contain an additional field with column header 
"amlno_acld_change," which is the field searched for to perform 
the amino acid comparison.) A database entry must lie within the 
user-specified number of bases (default = 5) or amino acids (default = 
2) of the MAF variant to be considered as a "nearby" match. We 
view the thresholds for "nearby" matches as adequate for taking 
into account the different definitions of reference sequences and 
gene transcripts that may be used by different contributors to the 
databases. Alternatively, "exact" matches are direct overlaps in both 
the location and base/ammo acid change of a variant in the MAF 
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and a mutation found in the database. If only the location (the 
genomic coordinate or amino acid position) of a variant match 
a database entry but not the nucleotide or amino acid change, these 
matches are deemed "position" matches. We currently ignore silent 

mutations in OMIM queries, as these types of mutations do not 
affect protein coding. However, the positions of silent mutations are 
still considered in comparison with the COSMIC database. 

All discoveries based on these database queries are appended 
to the input MAF file as extra columns (one column per database). 
The information written to the output file by the tool will either 
describe "exact" or "position" matches In amino add or base change 
or wlU (optionally) give all results from the gene associated with 
the varlcmt for a "nearby" match, one of which must qualify the 
site as actually having a "nearby" match. If a variant has no match 
In the particular database, a "novel" declaration is made for the 
variant, and (again, optionally) all of the results found from the 
associated gene will be printed next to the declaration to give 
the user an idea of what matches are possible for that gene. And 
lastly, If the gene for a variant Is not found in the database, a message 
noting this circumstance Is printed along with a "novel" declaration 
for the variant. 

For each queried database, the tool further prints an output 

summary which tallies the types of matches found throughout the 
entire data set using that database. The user can learn, for instance, 
how many sites matched exactly in both nucleotide and position, 
or perhaps how many variants matched database entries in only 
the amino acid category (labeled AA in the summary) while not 
matching In exact position or nucleotide change (labeled NT). 

Pfam annotation 

Our Pfam annotation tool supposes the use of Sanger's Pfam da- 
tabase (cited above) as another avenue for defining recurrence 
within the mutation list of a large group of cancer samples. The 
Pfam database is a catalog of functional regions (or domains) present 
within protein sequences. Referencing this list with the locations of 
somatic mutations across a cohort provides knowledge about which 
domcilns are most frequently affected by mutations In a certain dis- 
ease and, therefore, may provide some Insight as to the significance 
(or insignificance) of the mutations in question. 

In order to use the Pfam database, we have translated the 
catalog's amino acid coordinate system into a genomic position 
coordinate system. The amino acid sequences chosen to be trans- 
lated were ranked at every coding and splice site position in the 
genome, based first on the translatlonal and functional effect 
predicted by the transcript at a given position, and then secondly 
on the NCBI status of the particular transcript. The sequence of the 
"best" trcinscrlpt chosen at each site was used to recover Pfam do- 
mains from Sanger's database. We have Incorporated this Information 
into a table used by the tool via a fast-lookup program called Tabix, 
available in the SAMtools package (Li et al. 2009). The Pfam anno- 
tation module uses the genetic coordinates of a variant to append 
a Pfam annotation domain column to the end of any MAF file. 

This tool has been expanded to query a few other protein 
annotation databases In addition to Pfam. We provide information 
from the SUPERFAMILY database (Wilson et al. 2009), the SMART 
database (Schultz et al. 2000), and the Pattemscan database In our 
annotation query. (Pattemscan Is a new version of the Proslte 
database [Slgrlst et al. 2010].) Pattemscan provides information 
similar to that of the Pfam database regarding protein domains and 
families at functional sites, but SUPERFAMILY and SMART focus on 
evolved protein structural families and signaling protein domains, 
respectively. All of this information will be present in the output 
file, comma delimited, with the name of the database separated 
from the annotation result by an underscore. 



Software availability 

The MuSiC software tools, source code, and reference data are ac- 
cessible through our website, http://gmt.genome.wustl.edu, and 
are supported on Ubuntu Linux 10.04 (Lucid Lynx). Installation 
on all Deblan-based systems will Initiate automatic updates from 
our software server. The MuSlC suite Is also available on CPAN 
(http://www.cpan.org) and GltHub (https://glthub.com) imder the 
namespace "Genome" in both locations, and Integration of MuSlC 
Into Galaxy (http://usegalaxy.org) Is also available. A summary of 
benchmarking statistics for each MuSlC module Is available In the 
Supplemental Material. 
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